10. Split Dataset for Model Development Quizzes

Split Dataset for Model Development Quizzes

Training set composition

If I am training a CNN to classify between no tumor, benign tumor, and malignant tumor on mammograms, what should be the makeup of my **training ** dataset for each of those three types?

SOLUTION: 33.33% no tumor, 33.33% benign tumor, 33.33% malignant tumor

Splitting training and validation data

I have 1,000 images that I can use to train and validate a CNN on to classify between the presence or absence of calcifications in a mammogram. In the real world, calcifications are only prevalent about 30% of the time, and my dataset of 1,000 images reflects this. In order to maximize my data and throw away as little as possible, how should I split my data using the 80-20 split rule?

SOLUTION: 240 images with calcifications in my training set, 60 with calcifications in my validation set